Open in Binder Browser Extension

Today I am please to announce the release of a first project I've been working on for a bout a week: A Firefox extension to open the GitHub repository you are visiting using MyBinder.org.

If you are in a hurry, just head there to Install version 0.1.0 for Firefox. If you like to know more read on.

Binder Logo

Back to Firefox.

I've been using Chrome for a couple of years now, but heard a lot of good stuff about Rust and all the good stuff it has done or Firefox. Ok that's a bit of marketing but it got me to retry Firefox (Nightly please), and except for my password manager which took some week to update to the new Firefox API, I rapidly barely used Chrome.

Firefox Nightly Logo

MyBinder.org

I'm also spending more and more time working with the JupyterHub team on Binder, and see more and more developer adding binder badges to their repository. Mid of last week I though:

You know what's not optimal? It's painful to browse repositories that don't have the binder badge on MyBinder.org, also sometime you have to find the badge which is at the bottom of the readme.

You know what would be great to fix that ? A button in the toolbar doing the work for me.

Writing the extension

As I know Mozilla (which has a not so great new design BTW, but personal opinion) cares about making standard and things simple for their users, I though I would have a look at the new WebExtension.

And 7 days later, after a couple of 30 minutes break, I present to you a staggering 27 lines (including 7 line business logic) extension that does that:

(function() {
  function handleClick(){
    browser.tabs.query({active: true, currentWindow: true})
    .then((tabs) => {return tabs[0]})
    .then((tab) => {
      let url = new URL(tab.url);
      if (url.hostname != 'github.com'){
       console.warn('Open in binder only works on GitHub repositories for now.');
       return;
      };
      let parts = url.pathname.split('/');
      if (parts.length < 3){
        console.warn('While you are on GitHub, You do not appear to be in a github repository. Aborting.');
        return;
      }
      let my_binder_url = 'https://mybinder.org/v2/gh/'+parts[1] +'/'+parts[2] +'/master';
      console.info('Opening ' + url + 'using mybinder.org... enjoy !')
      browser.tabs.create({'url':my_binder_url});
    })

  }
  console.info('(Re) loading open-in-binder extension.');
  browser.browserAction.onClicked.addListener(handleClick);

  console.info('❤️ If you are reading this then you know about binder and javascript. ❤️');
  console.info('❤️ So you\'re skilled enough to contribute ! We\'re waiting for you on https://github.com/jupyterhub/ ❤️');
})()

You can find the original source here

Firefox Dev Logo

The hardest part was finding the API and learning how to package and set the icons correctly. There are still plenty of missing features and really low hanging fruits, even if you have never written an extension before (hey it's my first and I averaged 1-useful line/day writing it...).

General Feeling

Remember that I'm new to that and started a week ago.

The Mozilla docs are good but highly varying in quality, it feels (and is) a wiki. More opinionated tutorials might have been less confusing. A lot of statements are correct but not quite, and leaving the choice too users is just confusing. For example : you can use SVG or PNG icons, which I did, but then some area don't like SVG (addons.mozilla.org), and the WebExtensions should work on Chrome, but Chrome requires PNG. Telling me that I could use SVG was not useful.

The review of addons is blazingly fast (7min from first submissions to Human approved). Apple could learn from that if what I've heard here and there is correct..

The submission process has way to many manual steps, I'm ok for first submission, but updates, really ? I want to be able to fill-in all the information ahead of time (or generate them) and then have a cli to submit things. I hate filling forms online.

The first submission even if marked Beta will not be considered beta. So basically I published a 0.1.0beta1, then 0.1.0beta2 which did not trigger automatic update because the beta1 was not considered beta. Super confusing. I could "force" to see the beta3 page but with a warning that beta3 was an older version than beta1 ? What ?

There is still this feeling that this last 1% of polishing the process has not been done (That's usually where Apple is know to shine). For example your store icon will be resized to 64x64 (px) and display in a 64x64 (px) square but I have a retina screen ! So even if I submitted a 128x128 now my icon looks blurry ! WTF !

You can contribute

As I said earlier there is a lot of low hanging fruits ! I went through the process of figuring things out, so that you can contribute easily:

  • detect if not on /master/ and craft corresponding binder URL
  • Switch Icons to PNGs
  • test/package for Chrome
  • Add options for other binders than MyBinder.org
  • Add Screenshots and descriptions to the Addon Store.

So see you there !

JupyterCon - Display Protocol

This is an early preview of what I am going to talk about at Jupyter Con

Leveraging the Jupyter and IPython display protocol

This is a small essay to show how one can make a better use of the display protocol. All you will see in this blog post has been available for a couple of years but noone really built on top of this.

It is usually know that the IPython rich display mechanism allow libraries authors to define rich representation for their objects. You may have seen it in SymPy, which make extensive use of the latex representation, and Pandas which dataframes have nice HTML view.

What I'm going to show below, is that one is not limited to these – you can alter the representation of any existing object without modifying its source – and that this can be used to alter the view of containers, with the example of lists, to make things easy to read.

Modifying objects reprs

This section is just a reminder of how one can change define representation for object which source code is under your control. When defining a class, the code author needs to define a number of methods which should return the (data, metadata) pair for a given object mimetype. If no metadata is necesary, these can be ommited. For some common representations short methods name ara availables. These methond can be recognized as they all follow the following pattern _repr_*_(self). That is to say, an underscore, followed by repr followed by an underscore. The star * need to be replaced by a lowercase identifier often refering to a short human redable description of the format (e.g.: png , html, pretty, ...), ad finish by a single underscore. We note that unlike the python __repr__ (pronouced "Dunder rep-er" which starts and ends wid two underscore, the "Rich reprs" or "Reprs-stars" start and end with a single underscore.

Here is the class definition of a simple object that implements three of the rich representation methods:

  • "text/html" via the _repr_html_ method
  • "text/latex" via the _repr_latex_ method
  • "text/markdown" via the _repr_markdown method

None of these methonds return a tuple, thus IPython will infer that there is no metadata associated.

The "text/plain" mimetype representation is provided by the classical Python's __repr__(self).

In [1]:
class MultiMime:
    
    def __repr__(self):
        return "this is the repr"
    
    def _repr_html_(self):
        return "This <b>is</b> html"
    
    def _repr_markdown_(self):
        return "This **is** mardown"

    def _repr_latex_(self):
        return "$ Latex \otimes mimetype $"
In [2]:
MultiMime()
Out[2]:
This is html

All the mimetypes representation will be sent to the frontend (in many cases the notebook web interface), and the richer one will be picked and displayed to the the user. All representations are stored in the notebook document (on disk) and this can be choosen from when the document is later reopened – even with no kernel attached – or converted to another format.

External formatters and containers

As stated in teh introduction, you do not need to have control over an object source code to change its representation. Still it is often a more convenient process. AS an example we will build a Container for image thumbnails and see how we can use the code written for this custom container to apply it to generic Python containers like lists.

As a visual example we'll use Orly Parody books covers, in particular a small resolution of some of them so llimit the amount of data we'll be working with.

In [3]:
cd thumb
/Users/bussonniermatthias/dev/posts/thumb

let's see some of the images present in this folder:

In [4]:
names = !ls *.png
names[:20], f"{len(names) - 10} more"
Out[4]:
(['10x-big.png',
  'adulting-big.png',
  'arbitraryforecasts-big.png',
  'avoiddarkpatterns-big.png',
  'blamingthearchitecture-big.png',
  'blamingtheuser-big.png',
  'breakingthebackbutton-big.png',
  'buzzwordfirst-big.png',
  'buzzwordfirstdesign-big.png',
  'casualsexism-big.png',
  'catchingemall-big.png',
  'changinstuff-big.png',
  'chasingdesignfads-big.png',
  'choosingbasedongithubstars-big.png',
  'codingontheweekend-big.png',
  'coffeeintocode-big.png',
  'copyingandpasting-big.png',
  'crushingit-big.png',
  'deletingcode-big.png',
  'doingwhateverdanabramovsays-big.png'],
 '63 more')

in the above i've used an IPython specific syntax (!ls) ton conveniently extract all the files with a png extension (*.png) in the current working directory, and assign this to teh names variable.

That's cute, but, for images, not really usefull. We know we can display images in the Jupyter notebook when using the IPython kernel, for that we can use the Image class situated in the IPython.display submodule. We can construct such object simply by passing the filename. Image does already provide a rich representation:

In [5]:
from IPython.display import Image
In [6]:
im = Image(names[0])
im
Out[6]:

The raw data from the image file is available via the .data attribute:

In [7]:
im.data[:20]
Out[7]:
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\x90'

What if we map Images to each element of a list ?

In [8]:
from random import choices
mylist = list(map(Image, set(choices(names, k=10))))
mylist
Out[8]:
[<IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>,
 <IPython.core.display.Image object>]

Well unfortunately a list object only knows how to represent itself using text and the text representation of its elements. We'll have to build a thumbnail gallery ourself.

First let's (re)-build an HTML representation for display a single image:

In [9]:
import base64
from IPython.display import HTML
def tag_from_data(data, size='100%'):
    return (
        '''<img
             style="display:inline;
                    width:{1};
                    max-width:400px;
                    padding:10px;
                    margin-top:14px"
             src="data:image/png;base64,{0}"
           />
           ''').format(''.join(base64.encodebytes(data).decode().split('\n')), size)

We encode the data from bytes to base64 (newline separated), and strip the newlines. We format that into an Html template – with some inline style – and set the source (src to be this base64 encoded string). We can check that this display correctly by wrapping the all thing in an HTML object that provide a conveninent _repr_html_.

In [10]:
HTML(tag_from_data(im.data))
Out[10]:

Now we can create our own subclass, hich take a list of images and contruct and HTML representation for each of these, then join them together. We define and define a _repr_html_, that wrap the all in a paragraph tag, and add a comma between each image:

In [11]:
class VignetteList:
    
    
    def __init__(self, *images, size=None):
        self.images = images
        self.size = size
        
    def _repr_html_(self):
        return '<p>'+','.join(tag_from_data(im.data, self.size)  for im in self.images)+'</p>'
    
    def _repr_latex_(self):
        return '$ O^{rly}_{books} (%s\ images)$ ' % (len(self.images))
        

We also define a LaTeX Representation – that we will not use here, and look at our newly created object using previously defined list:

In [12]:
VignetteList(*mylist, size='200px')
Out[12]:

, , , , , , , ,

That is nice, though it forces us to unpack all the lists we have explicitely into a VignetteList – which may be annoying. Let's cleanup a bit the above, and register an external formatter for the "text/html" mimetype that should be used for any object which is a list. We'll also improve the formatter to recusrse in objects. THat is to say:

  • If it's an image return the PNG data in an <img> tag,
  • If it's an object that has an text/html reprensetation, use that.
  • Otherwise, use th repr.

With this we loose some nice formatting of text lists with the pretty module, we could easily fix that; but we leve it as an exercice for the reader. We're also going to recusrse into objects, that have a html representation. That it to say, make it work with lists of lists.

In [13]:
def tag_from_data_II(data, size='100%'):
    return '''<img
                    style="
                        display:inline;
                        width:{1};
                        max-width:400px;
                        padding:10px;
                        margin-top:14px"
                    onMouseOver="this.style['box-shadow']='5px 5px 30px 0px rgba(163,163,163,1)'" 
                    onMouseOut="this.style['box-shadow']=''"
                    src="data:image/png;base64,{0}" 
             />'''.format(''.join(base64.encodebytes(data).decode().split('\n')), size)

def html_list_formatter(ll):
    html = get_ipython().display_formatter.formatters['text/html']
    reps = []
    for o in ll:
        if isinstance(o, Image):
            reps.append(tag_from_data_II(o.data, '200px') )
        else: 
            h = html(o)
            if h:    
                reps.append(h)
            else:
                reps.append(repr(o)+'')
    
    return '<span>['+','.join(reps)+']</span>'

Same as before, with square bracket after and before, and a bit of styling that change the drop shadow on hover. Now we register the above with IPython:

In [14]:
ipython = get_ipython()
html = ipython.display_formatter.formatters['text/html']
html.for_type(list, html_list_formatter)
In [15]:
mylist
Out[15]:
[,,,,,,,,]

Disp

External integration for some already existing object is available in disp, in particular you will find representation for SparkContext, requests's Responses object (collapsible json content and headers), as well as a couple others.

Magic integration

The above demonstatratino show that a kernel is more than a language, it is a controling process that manage user requests (in our case code execution) and how the results are returned to the user. There is often the assumtion that a kernel is a single language, this is an incorrect assumtion as a kernl proces may manage several language and can orchestrate data movement from one language to another.

In the following we can see how a Python process make use of what we have defined above to make sql querries returning rich results. We also see that the execution od SQL queries have side effects in the Python namespace, showing how the kernel can orchestrate things.

In [16]:
load_ext fakesql
In [17]:
try:
    rly
except NameError:
    print('`rly` not defined')
`rly` not defined
In [18]:
%%sql
SELECT name,cover from orly WHERE color='red' LIMIT 10
Out[18]:
[['buzzwordfirst-big.png',],['buzzwordfirstdesign-big.png',],['goodenoughtoship-big.png',],['noddingalong-big.png',],['resumedrivendevelopment-big.png',],['takingonneedlessdependencies-big.png',]]
In [19]:
rly[2]
Out[19]:
['goodenoughtoship-big.png',]

It would not be hard to have modification of the Python namespace to affect the SQL database, this is left as an exercise to the user as well (hint use properties) and to have integration with other languages like R, Julia, ...

Note:

This notebook has initially been written to display prototype features of IPython and the Jupyter notebook, in particular completions of cell magic (for the Sql Cell), and UI element allowing to switch between the shown mimetype. This will not be reflected in static rendering and is not mentioned in the text, which may lead to a confusing read.

Migration to Python 3 only

This is a personal experience of having migrated IPython from being single source Py2-Py3 to Python 3 only.

The migration plan

The migration of IPython to be Python 3 only, started about a year ago. For the last couple of years, the IPython code base was "single source", meaning that yo could run it on Python 2 and Python 3 without a single change to the source code.

We could have made the transition to a Python 3 only code base with the use of a transpiler (like 2to3, but 3to2), though there does not seem to be any commonly used tools. This would also have required taking care of functionality backport, which can be a pain, and things like async-io are quasi impossible to backport cleanly to Python 2

So we just dropped Python 2 support

The levels of Non-support

While it is easy to use the term "non-supported" there are different level of non-support.

  • Do not release for Python 2, but ou can "compile" or clone/install yourself.
  • Officially saying "this software is not meant to run on Python 2", but it still does and is released.
  • CI Tests are run on Python 2 but "allow failure"
    • likely to break, but you accept PRs to fix things
  • CI Tests are not run on Python 2, PR fixing things are accepted
  • PR to fix things on Python 2 are not accepted
  • You are actively adding Python 3 only code
  • You are actively removing Python 2 code
  • You are actively keeping Python 2 compatibility, but make the software delete user home directory.

We settle somewhere in between adding python 3 only feature, and removing Python 2 code.

Making a codebase Python 3 only is "easy" in the sens that adding a single yield from is enough to make your code not valid Python 2, and no __future__ statement can fix that.

Removing code

One of the things you will probably see in the background of this section is that static languages would be of great help for this task. I would tend to say "thank you captain obvious", but there is some truth. Though Python is not a static language and we are trying to see how we can write Python in a better way to ease the transition.

the obvious

There are obvious functions that are present only for Python 2. In general present in if Py2 blocks. These can simply be deleted, and hopefully now your linter will complain about a ton of unused variable and import you can remove.

This is not always the case with function definition as most linter assume function are exported. You can help with coverage, but then you have to make sure your function is not tested separately on Python 3.

One of the indirect effect in many places was the reduced indentation. Especially at module level this lead to much greater readability as module-level function are easily confused for object methods when indented in an if py2:

EAFP vs LBYL

It is common in Python to use try/except in place of if/else condition. The well-known hasattr works by catching an exception, and if/else is subject to race conditions. So it's not uncommon to hear that "Easier to Ask Forgiveness than Permission" is preferred to "Look Before you Leap". That might be a good move in a codebase with requirement that will never change, though in the context of code removal it is an hassle. Indeed when encountering a try/except which is likely meant to handle a change of behavior between versions of Python is hard to know for which version(s) of Python this was written – some changes are between minor versions ; in which order is the try/except written (Python 2 in the try, or in the except clause), and especially it is quasi impossible to find these location.

In the other hand explicit if statement (if sys.version_info < (3,)) are easy to find – remember you only need to compare the first item of the tuple – and easy to reduce to the only needed branch. It's also way easier to apply (and find) these for minor versions.

The zen of Python had it right: Explicit is better than implicit.

For me at least, try/except ImportError, AttributeError is a pattern I'll avoid in favor of explicit if/else.

byte/str/string/unicode

There is a couple location where you might have to deal with bytes/unicode/str/string – oh boy, these names are not well chosen. In particular in area where you are casting thing that are bytes to unicode and vice-versa. And I can never remember when I read cast_bytes_py2 if it's doing nothing on Python 2, or nothing on Python 3. Though once you got the hang of it the code is soooo much shorter and simpler and clearer in your head.

Remember bytes<->unicode at boundary and keep things Unicode everywhere in your programs if you want to avoid headache. Good Python Code is boring Python code.

Python 2-ism

Dealing with removing Python 2 code made me realise that there is still a lot of Python-2-ism in most of the Python 3 code I write.

inheriting classes

Writing classes that do not need to inherit from object feels weird, and I definitively don't have the habit (yet) of not doing it. Having the ability to use a bare super() is great as I fevered remembered the order of parameter.

Pathlib

IPython uses a lot of path manipulation, so we keep using os.path.join in many paces, or even just use the with open(...) context manager. If you can afford it and target only recent python version pathlib and Path object are great alternative that we tend to forget exist.

decode

Most of decode/encode operation do the right things, there is almost no need to precise the encoding anywhere. This make handling bytes-> str conversion even easier.

Python 3 ism

This are the feature of Python 3 which do not have equivalent in Python 2 and would make great addition in many code base. I tend to forget they exist and do not design code around them enough.

async/await

I'm just scratching the surface of async/await, and I definitively see great opportunities here. You need to design code to work in an async-fashion, but it should be relatively straightforward to use async code from synchronous one. I should learn more about sans-io (google is your friend) to make code reusable.

type anotations

Type annotation are an incredible feature that even just as visual annotation replace numpydoc. I have a small grudge against the pep8 that describe the position of space, but even without mypy the ability to annotate type is a huge boon for documentation. Now docstring can focus on why/how of functions.

kwarg only

Keyword arguments only is a great feature of Python 3, often under-appreciated the *-syntax is IMHO a bit clunky – but I don't have a better option. It give you a great flexibility in api without sacrifying backward compatibility. I wish I had position only as well.

Writing an async REPL - Part 1

This is a first part in a series of blog post which explain how I implemented the ability to await code at the top level scope in the IPython REPL. Don't expect the second part soon, or bother me for it. I know I shoudl write it, but time is a rarte luxury.

It is an interesting adventure into how Python code get executed, and I must admit it changed quite a bit how I understand python code now days and made me even more excited about async/await in Python.

It should also dive quite a bit in the internals of Python/CPython if you ever are interested in what some of these things are.

In [1]:
# we cheat and deactivate the new IPython feature to match Python repl behavior
%autoawait False

Async or not async, that is the question

You might now have noticed it, but since Python 3.5 the following is valid Python syntax:

In [2]:
async def a_function():
    async with contextmanager() as f:
        result = await f.get('stuff')
        return result

So you've been curious and read a lot about asyncio, and may have come across a few new libraries like aiohttp and all hte aio-libs, heard about sans-io, read complaints and we can take differents approaches, and maybe even maybe do better. You vaguely understand the concept of loops and futures, the term coroutine is still unclear. So you decide to poke around yourself in the REPL.

In [3]:
import aiohttp
In [4]:
print(aiohttp.__version__)
coro_req = aiohttp.get('https://api.github.com')
coro_req
1.3.5
Out[4]:
<aiohttp.client._DetachedRequestContextManager at 0x1045289d8>
In [5]:
import asyncio
res = asyncio.get_event_loop().run_until_complete(coro_req)
In [6]:
res
Out[6]:
<ClientResponse(https://api.github.com) [200 OK]>
<CIMultiDictProxy('Server': 'GitHub.com', 'Date': 'Thu, 06 Apr 2017 19:49:20 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Status': '200 OK', 'X-Ratelimit-Limit': '60', 'X-Ratelimit-Remaining': '50', 'X-Ratelimit-Reset': '1491508909', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept', 'Etag': 'W/"7dc470913f1fe9bb6c7355b50a0737bc"', 'X-Github-Media-Type': 'github.v3; format=json', 'Access-Control-Expose-Headers': 'ETag, Link, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval', 'Access-Control-Allow-Origin': '*', 'Content-Security-Policy': "default-src 'none'", 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'deny', 'X-Xss-Protection': '1; mode=block', 'Vary': 'Accept-Encoding', 'X-Served-By': 'a51acaae89a7607fd7ee967627be18e4', 'Content-Encoding': 'gzip', 'X-Github-Request-Id': '8182:3911:C50FFE:EF0636:58E69BC0')>
In [7]:
res.json()
Out[7]:
<generator object ClientResponse.json at 0x1052cd9e8>
In [8]:
json = asyncio.get_event_loop().run_until_complete(res.json())
json
Out[8]:
{'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'current_user_repositories_url': 'https://api.github.com/user/repos{?type,page,per_page,sort}',
 'current_user_url': 'https://api.github.com/user',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': 'https://api.github.com/user/keys',
 'notifications_url': 'https://api.github.com/notifications',
 'organization_repositories_url': 'https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}',
 'organization_url': 'https://api.github.com/orgs/{org}',
 'public_gists_url': 'https://api.github.com/gists/public',
 'rate_limit_url': 'https://api.github.com/rate_limit',
 'repository_search_url': 'https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}',
 'repository_url': 'https://api.github.com/repos/{owner}/{repo}',
 'starred_gists_url': 'https://api.github.com/gists/starred',
 'starred_url': 'https://api.github.com/user/starred{/owner}{/repo}',
 'team_url': 'https://api.github.com/teams',
 'user_organizations_url': 'https://api.github.com/user/orgs',
 'user_repositories_url': 'https://api.github.com/users/{user}/repos{?type,page,per_page,sort}',
 'user_search_url': 'https://api.github.com/search/users?q={query}{&page,per_page,sort,order}',
 'user_url': 'https://api.github.com/users/{user}'}

It's a bit painful to pass everything to run_until_complete, you know how to write async-def function and pass this to an event loop:

In [9]:
loop = asyncio.get_event_loop()
run = loop.run_until_complete
url = 'https://api.github.com/rate_limit'

async def get_json(url):
    res = await aiohttp.get(url)
    return await res.json()

run(get_json(url))
Out[9]:
{'rate': {'limit': 60, 'remaining': 50, 'reset': 1491508909},
 'resources': {'core': {'limit': 60, 'remaining': 50, 'reset': 1491508909},
  'graphql': {'limit': 0, 'remaining': 0, 'reset': 1491511760},
  'search': {'limit': 10, 'remaining': 10, 'reset': 1491508220}}}

Good ! And the you wonder, why do I have to wrap thing ina function, if I have a default loop isn't it obvious what where I want to run my code ? Can't I await things directly ? So you try:

In [10]:
await aiohttp.get(url)
  File "<ipython-input-10-055eb13ed07d>", line 1
    await aiohttp.get(url)
                ^
SyntaxError: invalid syntax

What ? Oh that's right there is no way in Pyton to set a default loop... but a SyntaxError ? Well, that's annoying.

Outsmart Python

Hopefully you (in this case me), are in control of the REPL. You can bend it to your will. Sure you can do some things. First you try to remember how a REPL works:

In [11]:
mycode = """
a = 1
print('hey')
"""
def fake_repl(code):
    import ast
    module_ast = ast.parse(mycode)
    bytecode = compile(module_ast, '<fakefilename>', 'exec')
    global_ns = {}
    local_ns = {}
    exec(bytecode, global_ns, local_ns)
    return local_ns

fake_repl(mycode)
hey
Out[11]:
{'a': 1}

We don't show global_ns as it is huge, it will contain all that's availlable by default in Python. Let see where it fails if you use try a top-level async statement:

In [12]:
import ast
mycode = """
import aiohttp
await aiohttp.get('https://aip.github.com/')
"""

module_ast = ast.parse(mycode)
  File "<unknown>", line 3
    await aiohttp.get('https://aip.github.com/')
                ^
SyntaxError: invalid syntax

Ouch, so we can't even compile it. Let be smart can we get the inner code ? if we wrap in async-def ?

In [13]:
mycode = """
async def fake():
    import aiohttp
    await aiohttp.get('https://aip.github.com/')
"""
module_ast = ast.parse(mycode)
ast.dump(module_ast)
Out[13]:
"Module(body=[AsyncFunctionDef(name='fake', args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Import(names=[alias(name='aiohttp', asname=None)]), Expr(value=Await(value=Call(func=Attribute(value=Name(id='aiohttp', ctx=Load()), attr='get', ctx=Load()), args=[Str(s='https://aip.github.com/')], keywords=[])))], decorator_list=[], returns=None)])"
In [14]:
ast.dump(module_ast.body[0])
Out[14]:
"AsyncFunctionDef(name='fake', args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Import(names=[alias(name='aiohttp', asname=None)]), Expr(value=Await(value=Call(func=Attribute(value=Name(id='aiohttp', ctx=Load()), attr='get', ctx=Load()), args=[Str(s='https://aip.github.com/')], keywords=[])))], decorator_list=[], returns=None)"

As a reminder, as AST stands for Abstract Syntax Tree, you may construct an AST which is not a valid Python, program, like an if-else-else. AST tree can be modified. What we are interested in it the body of the function, which itself is the first object of a dummy module:

In [15]:
body = module_ast.body[0].body
body
Out[15]:
[<_ast.Import at 0x105d503c8>, <_ast.Expr at 0x105d50438>]

Let's pull out the body of the function and put it at the top level of a newly created module:

In [16]:
async_mod = ast.Module(body)
ast.dump(async_mod)
Out[16]:
"Module(body=[Import(names=[alias(name='aiohttp', asname=None)]), Expr(value=Await(value=Call(func=Attribute(value=Name(id='aiohttp', ctx=Load()), attr='get', ctx=Load()), args=[Str(s='https://aip.github.com/')], keywords=[])))])"

Mouahahahahahahahahah, you managed to get a valid top-level async ast ! Victory is yours !

In [17]:
bytecode = compile(async_mod, '<fakefile>', 'exec')
  File "<fakefile>", line 4
SyntaxError: 'await' outside function

Grumlgrumlgruml. You haven't said your last word. Your going to take your revenge later. Let's see waht we can do in Part II, not written yet.

Changing ByteStr REPR

A recent rebutal against Python 3 was recently written by the (in)famous Zed Shaw, with many responses to various arguments and counter arguments.

One particular topic which caught my eye was the bytearray vs unicodearray debate. I'll try explicitely avoid the term str/string/bytes/unicode naming as it is (IMHO) confusing, but that's a debate for another time. If one pay attention to above debates, you might see that there are about two camps:

  • bytearray and unicodearray are two different things, and we should never convert from one to the other. (that's rought the Pro-Python-3 camp)
  • bytearray and unicodearray are similar enough in most cases that we should do the magic for users.

I'm greatly exagerating here and the following is neither for one side or another, I have my personal preference of what I think is good, but that's irrelevant for now. Note that both sides argue that their preference is better for beginners.

You can often find posts trying to explain the misconception string/str/bytes, like this one which keep insisting on the fact that str in python 3 is far different from bytes.

The mistake in the REPR

I have one theory that the bytes/str issue is not in their behavior, but in their REPR. The REPR is in the end the main informatin communication channel between the object and the brain of the programmer, user. Also, Python "ducktyped", and you have to admit that bytes and str kinda look similar when printed, so assuming they should behave in similar way is not far fetched. I'm not saying that user will conciously assume bytes/str are the same. I'm saying that human brain inherently may do such association.

From the top of your head, what does requests.get(url).content returns ?

In [1]:
import requests_cache
import requests
requests_cache.install_cache('cachedb.tmp')
In [2]:
requests.get('http://swapi.co/api/people/1').content
Out[2]:
b'{"name":"Luke Skywalker","height":"172","mass":"77","hair_color":"blond","skin_color":"fair","eye_color":"blue","birth_year":"19BBY","gender":"male","homeworld":"http://swapi.co/api/planets/1/","films":["http://swapi.co/api/films/6/","http://swapi.co/api/films/3/","http://swapi.co/api/films/2/","http://swapi.co/api/films/1/","http://swapi.co/api/films/7/"],"species":["http://swapi.co/api/species/1/"],"vehicles":["http://swapi.co/api/vehicles/14/","http://swapi.co/api/vehicles/30/"],"starships":["http://swapi.co/api/starships/12/","http://swapi.co/api/starships/22/"],"created":"2014-12-09T13:50:51.644000Z","edited":"2014-12-20T21:17:56.891000Z","url":"http://swapi.co/api/people/1/"}'

... bytes...

I'm pretty sure you glanced ahead in this post and probaly thought it was "Text", even probably in this case Json. It might be invalid Json, I'm pretty sure you cannot tell.

Why does it returns bytes ? Because it could fetch an image:

In [3]:
requests.get('https://avatars0.githubusercontent.com/u/335567').content[:200]
Out[3]:
b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xcc\x00\x00\x01\xcc\x08\x06\x00\x00\x00X\xdb\x98\x86\x00\x00 \x00IDATx\xda\xac\xbdy\x93\x1b\xb9\xb2\xf6\xf7K\x00\xb5\x90\xbdH\xa3\x99\xb9s7\xbf\xf1:\x1c\x0e/\xdf\xff\xdb8\xec\xb0}\xd79g4Rw\xb3IV\x15\x80\xf4\x1f@\xedUl\xea\\w\x84\xa65-6Y\x85\x02ry\xf2\xc9'\xa5\xfe\x9f\xfeGE\x04#\x821\x061\x16c\x0c\xc6XD\x0c\x02\xa0\x8a\x8a\x801\xa4\x1f\x08\x880\xfdRUD\x04\xd5\xfe\xff#6z\x8c*\xaa\x82\x88\xe0C \x84@\xf7~\xa6yy\xc5=>Q>~\xe6\xe1\xf3g~\xfd\xa7\x7f\xc28\x07\xb6\x00\x84h-\x88A1(\xe0U\xd2\xfb\xb8t\r1("

And if you decode the first request ?

In [4]:
requests.get('http://swapi.co/api/people/2').content.decode()
Out[4]:
'{"name":"C-3PO","height":"167","mass":"75","hair_color":"n/a","skin_color":"gold","eye_color":"yellow","birth_year":"112BBY","gender":"n/a","homeworld":"http://swapi.co/api/planets/1/","films":["http://swapi.co/api/films/5/","http://swapi.co/api/films/4/","http://swapi.co/api/films/6/","http://swapi.co/api/films/3/","http://swapi.co/api/films/2/","http://swapi.co/api/films/1/"],"species":["http://swapi.co/api/species/2/"],"vehicles":[],"starships":[],"created":"2014-12-10T15:10:51.357000Z","edited":"2014-12-20T21:17:50.309000Z","url":"http://swapi.co/api/people/2/"}'

Well that looks the same (except leading b...). Go explain a beginner that the 2 above are totally different things, while they already struggle with 0 base indexing, iterators, and the syntax of the language.

Changing the repr

Lets revert the repr of bytesarray to better represent what they are. IPython allows to change object repr easily:

In [5]:
text_formatter = get_ipython().display_formatter.formatters['text/plain']
In [6]:
def _print_bytestr(arg, p, cycle):
    p.text('<BytesBytesBytesBytesBytes>')        
text_formatter.for_type(bytes, _print_bytestr)
Out[6]:
<function IPython.lib.pretty._repr_pprint>
In [7]:
requests.get('http://swapi.co/api/people/4').content
Out[7]:
<BytesBytesBytesBytesBytes>

Make a usefull repr

<bytesbytesbytes> may not an usefull repr, so let's try to make a repr, that:

  • Convey bytes are, in genral not text.
  • Let us peak into the content to guess what it is
  • Push the user to .decode() if necessary.

Generally in Python objects have a repr which start with <, then have the class name, a quoted representation of the object, and memory location of the object, a closing >.

As the _quoted representation of the object may be really long, we can ellide it.

A common representation of bytes could be binary, but it's not really compact. Hex, compact but more difficult to read, and make peaking at the content hart when it could be ASCII. So let's go with ASCII reprentation where we escape non ASCII caracterd.

In [8]:
ellide = lambda s: s if (len(s) < 75) else  s[0:50]+'...'+s[-16:]
In [9]:
def _print_bytestr(arg, p, cycle):
    p.text('<bytes '+ellide(repr(arg))+' at {}>'.format(hex(id(arg))))       
text_formatter.for_type(bytes, _print_bytestr)
Out[9]:
<function __main__._print_bytestr>
In [10]:
requests.get('http://swapi.co/api/people/12').content
Out[10]:
<bytes b'{"name":"Wilhuff Tarkin","height":"180","mass":"...pi/people/12/"}' at 0x107299228>
In [11]:
requests.get('http://swapi.co/api/people/12').content.decode()
Out[11]:
'{"name":"Wilhuff Tarkin","height":"180","mass":"unknown","hair_color":"auburn, grey","skin_color":"fair","eye_color":"blue","birth_year":"64BBY","gender":"male","homeworld":"http://swapi.co/api/planets/21/","films":["http://swapi.co/api/films/1/","http://swapi.co/api/films/6/"],"species":["http://swapi.co/api/species/1/"],"vehicles":[],"starships":[],"created":"2014-12-10T16:26:56.138000Z","edited":"2014-12-20T21:17:50.330000Z","url":"http://swapi.co/api/people/12/"}'

Advantage: It is not gobbledygook anymore when getting binary resources !

In [12]:
requests.get('https://avatars0.githubusercontent.com/u/335567').content
Out[12]:
<bytes b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\...0IEND\xaeB`\x82' at 0x107e0c000>

Remapping notebook shortcuts

As Jupyter notebook run in a browser for technical and practical reasons we only have a limited number of shortcuts available and choices need to be made. Often this choices may conflict with browser shortcut, and you might need to remap it.

Today I was inform by Stefan van Der Walt that Cmd-Shift-P conflict for Firefox. It is mapped both to open the Command palette for the notebook and open a new Private Browsing window.

Using Private Browsing windows is extremely useful. When developing a website you might want to look at it without being logged in, and with an empty cache. So let see how we can remap the Jupyter notebook shortcut.

TL; DR;

Use the following in your ~/.jupyter/custom/custom.js :

require(['base/js/namespace'], function(Jupyter){
  // we might want to but that in a callback on wait for 
  // en even telling us the ntebook is ready.
  console.log('== remaping command palette shortcut ==')
  // note that meta is the command key on mac.
  var source_sht = 'meta-shift-p'
  var target_sht = 'meta-/'
  var cmd_shortcuts = Jupyter.keyboard_manager.command_shortcuts;
  var action_name = cmd_shortcuts.get_shortcut(source_sht)
  cmd_shortcuts.add_shortcut(target_sht, action_name)
  cmd_shortcuts.remove_shortcut(source_sht)
  console.log('== ', action_name, 'remaped from', source_sht, 'to', target_sht )
})

details

We need to use require and register a callback once the notebook is loaded:

require(['base/js/namespace'], function(Jupyter){
  ...
})

Here we grab the main namespace and name it Jupyter.

Then get the object that hold the various shortcuts: var cmd_shortcuts = Jupyter.keyboard_manager.command_shortcuts.

Shortcuts are define by sequence on keys with modifiers. Modifiers are dash-separated (need to be pressed at the same time). Sequence are comma separated. Example quiting in vim would be esc,;,w,q, in emacs ctrl-x,ctrl-c.

Here we want to unbind meta-shift-p (p is lowercase despite shift being pressed) and bind meta-/ (The shortcut Stefan wants). Note that meta- is the command key on mac.

We need to get the current command bound to this shortcut (cmd_shortcuts.get_shortcut(source_sht)). You could hardcode the name of the command but it may change a bit depending on notebook version (this is not yet public API). Here it is jupyter-notebook:show-command-palette.

You now bind it to your new shortcut:

cmd_shortcuts.add_shortcut('meta-/', action_name)

And finally unbind the original one

cmd_shortcuts.remove_shortcut('meta-shift-p')

UI reflect your changes !

If you open the command palette, you should see that the Show command palette command now display Command-/ as its shortcut !

Future

We are working on an interface to edit shortcuts directly from within the UI and not to have to write a single line of code !

Questions, feedback and fixes welcomed

Viridisify

Viridisify

As usual this is available and as been written as a jupyter notebook if you like to play with the code feel free to fork it.


The jet colormap (AKA "rainbow") is ubiquitous, there are a lot of controverse as to wether it is (from far) the best one. And better options have been designed.

The question is, if you have a graph that use a specific colormap, and you would prefer for it to use another one; what do you do ?

Well is you have th eunderlying data that's easy, but it's not always the case.

So how to remap a plot which has a non perceptually uniform colormap using another ? What's happend if yhere are encoding artificats and my pixels colors are slightly off ?

I came up with a prototype a few month ago, and was asked recently by @stefanv to "correct" a animated plot of huricane Matthew, where the "jet" colormap seem to provide an illusion of growth:

https://twitter.com/stefanvdwalt/status/784429257556492288

Let's see how we can convert a "Jet" image to a viridis based one. We'll first need some assumptions:

  • This assume that you "know" the initial color map of a plot, and that the emcoding/compressing process of the plot will not change the colors "too much".
  • There are pixels in the image which are not part of the colormap (typically text, axex, cat pictures....)

We will try to remap all the pixels that fall not "too far" from the initial colormap to the new colormap.

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
In [2]:
import matplotlib.colors as colors
In [3]:
!rm *.png *.gif out*
rm: output.gif: No such file or directory

I used the following to convert from mp4 to image sequence (8 fps determined manually). Sequence of images to video, and video to gif (quality is better than to gif dirrectly):

$ ffmpeg -i INPUT.mp4 -r 8 -f image2 img%02d.png
$ ffmpeg -framerate 8 -i vir-img%02d.png -c:v libx264 -r 8 -pix_fmt yuv420p out.mp4
$ ffmpeg -i out.mp4  output.gif
In [4]:
%%bash
ffmpeg -i input.mp4 -r 8 -f image2 img%02d.png -loglevel panic

Let's take our image without the alpha channel, so only the first 3 components:

In [5]:
import matplotlib.image as mpimg
img = mpimg.imread('img01.png')[:,:,:3]
In [6]:
fig, ax = plt.subplots()
ax.imshow(img)
fig.set_figheight(10)
fig.set_figwidth(10)

As you can see it does use "Jet" (most likely),

let's look at the repartitions of pixels on the RGB space...

In [7]:
import numpy as np
from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt
In [8]:
def rep(im, cin=None, sub=128):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    pp = im.reshape((-1,3)).T[:,::300]
    
    if cin:
        cmapin = plt.get_cmap(cin)
        cmap256 = colors.makeMappingArray(sub, cmapin)[:, :3].T
        ax.scatter(cmap256[0], cmap256[1], cmap256[2], marker='.', label='colormap', c=range(sub), cmap=cin, edgecolor=None)
    
    ax.scatter(pp[0], pp[1], pp[2], c=pp.T, marker='+')
    
    ax.set_xlabel('R')
    ax.set_ylabel('G')
    ax.set_zlabel('B')
    ax.set_title('Color of pixels')
    if cin:
        ax.legend()
    return ax
    
ax = rep(img)

We can see a specific clusers of pixel, let's plot the location of our "Jet" colormap and a diagonal of "gray". We can guess the effect of various compressions artifacts have jittered the pixels slightly away from their original location.

Let's look at where the jet colormap is supposed to fall:

In [9]:
rep(img, 'jet')
Out[9]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x111c9cc88>

Ok, that's pretty accurate, we also see that our selected graph does nto use the full extent of jet.

in order to find all the pixels that uses "Jet" efficiently we will use scipy.spatial.KDTree in the colorspace. In particular we will subsample the initial colormap in sub=256 subsamples, and collect only pixels that are within d=0.2 of this subsample, and map each of these pixels to the closer subsample.

As we know the subsampling of the initial colormap, we can also determine the output colors.

The Pixels that are "too far" from the pixels of the colormap are keep unchanged.

increasing 256 to higher value will give a smoother final colormap.

In [10]:
from scipy.spatial import cKDTree
In [11]:
def convert(sub=256, d=0.2, cin='jet', cout='viridis', img=img, show=True):
    viridis = plt.get_cmap(cout)
    cmapin = plt.get_cmap(cin)
    cmap256 = colors.makeMappingArray(sub, cmapin)[:, :3]
    original_shape = img.shape
    img_data = img.reshape((-1,3))
    
    # this will efficiently find the pixels "close" to jet
    # and assign them to which point (from 1 to 256) they are on the colormap.
    K = cKDTree(cmap256)
    res = K.query(img_data, distance_upper_bound=d)
    
    indices = res[1]
    l = len(cmap256)
    indices = indices.reshape(original_shape[:2])
    remapped = indices

    indices.max()

    mask = (indices == l)

    remapped = remapped / (l-1)
    mask = np.stack( [mask]*3, axis=-1)

    # here we add only these pixel and plot them again with viridis.
    blend = np.where(mask, img, viridis(remapped)[:,:,:3])
    if show:
        fig, ax = plt.subplots()
        fig.set_figheight(10)
        fig.set_figwidth(10)
        ax.imshow(blend)
    return blend
In [12]:
res = convert(img=img)
rep(res)
Out[12]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x113791278>

Let's loot at what happend if we decrease our leniency for the "proximity" of each pixel to the jet colormap:

In [13]:
rep(convert(img=img, d=0.05))
Out[13]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x1159fd6d8>

Ouch ve definitively missed some pixels.

In [14]:
rep(convert(img=img, sub=8, d=0.4))
Out[14]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x10d5faef0>

Subsampling to 8 colors (see above) forces us to increase or distance to accept points, and hint the non-linearity of "Jet" as seen in the colorbar.

In [15]:
rep(convert(img=img, sub=256, d=0.7))
Out[15]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x115a464a8>

Beeing too lenient on the distance between the colormap and the pixel will change the color of undesired part of our image.

Also look at our clean image that does nto scatter our pixel in RGB space !

Ok, we've played enough, let's convert all our images and re-make a gif/mp4 out of it...

In [16]:
tpl = 'img%02d.png'
tplv = 'vir-img%02d.png'
for i in range(1,18):
    img = mpimg.imread(tpl%i)[:,:,:3]
    vimg = convert(show=False, img=img)
    mpimg.imsave(tplv %i, vimg)
In [17]:
%%bash
ffmpeg -framerate 8 -i vir-img%02d.png -c:v libx264 -r 8 -pix_fmt yuv420p out.mp4 -y -loglevel panic
In [18]:
%%bash
ffmpeg -i out.mp4  output.gif -y -loglevel panic

Enjoy the result, and see how the north part of the huricane does not look like getting that much increase in intensity !

https://twitter.com/Mbussonn/status/784447098963972098

It's not a reason not to stay safe from huricanes.


Thanks to Stefan Van der Walt, and Nathaniel Smith for inspiration and helpful discussion.


Notes:

Michael Ayes asks:

that’s cool, but technically, that’s viridis_r, isn’t it?

That's discutable as I do a map from Jet to Viridis, the authors of the initial graph seem to have user jet_r (reversed jet) so the final version looks like viridis_r (reversed viridis). More generally if the the original graph had used f(jet), then the final version woudl be close to f(viridis).

As usual please feel free to ask questions or send me updates, as my english is likely far from perfect.

Cross Language Integration

Jupyter and multiple Languages

Note: This has been written as a notebook so you can download it to run it yourself using jupyter or nteract, view it on GitHub, or on nbviewer if you prefer the classical notebook rendering. It originally should have appeared on my blog.

I would be happy to know if you can get it to work on binder.


[UPDATE]

Thanks to Michael Pacer for copy-editting this. Also note that this notebook is basedaround multiple example that have been written by various people across many years.


An often requested feature for the Jupyter Notebook is the ability to have multiple kernels, often in many languages, for a single notebook.

While the request in spirit is a perfectly valid one, it is often a misunderstanding of what having a single kernel means. In particular having multiple language is often easier if you have a single process which handle the dispatching of various instructions to potentially multiple underlying languages. It is possible to do that in a Single Kernel which does orchestrate dispatching instruction and moving data around.

Whether the multiple languages that get orchestrated together are remote processes, or simply library calls or more complex mechanisms becomes an implementation detail.

Python is known to be a good "glue" language, and over the year the IPython kernel have seen a growing number of extensions showing that dynamic cross language integration can be seamless form the point of view of the user.

In the following we only scratch the surface of what is possible across a variety of languages. The approach shown here is one among many. The Calysto organisation for example has several projects taking different approaches on the problem.

In the following I will show a quick overview on how you can in single notebook interact with many languages, via subprocess call (Bash, Ruby), Common Foreign function interface (C, Rust, Fortran, ...), or even crazier approaches (Julia).

IPython and cross language integration

The rest of this is mostly a demo on how cross-language integration works in a Jupyter notebook by using the features of the Reference IPython Kernel implementation. These features are completely handled in the kernel so need to be reimplemented on a per-kernel basis. Though they also work on pure terminal IPython, nbconvert or any other programmatic use of IPython.

Most of what you will see here are just thin wrappers around already existing libraries. These libraries (and their respective authors) do all the heavy lifting. I just show how seamless a cross language environment can be from the user point of view. The installation of these library might not be easy either and getting all these language to play together can be complex task. It is though becoming easier and easier.

The term just does not imply that the wrappers are simple, or easy to write. It indicate that the wrappers are far from being complete. What is shown here is completely doable using standard Python syntax and bit of manual work. SO what you'll see here is mostly convenience.

The good old example of Fibonacci

Understanding the multiple languages themselves is not necessary; most of the code here should self explanatory and straightforward. We'll define many function that compute the nth Fibonacci number more or less efficiently. We'll define them either using the classic recursive implementation, or sometime using an unrolled optimized version. As a reminder the Fibonacci sequence is defines a the following:

$$ F_n = \begin{cases} 1 &\mbox{if } n \leq 2 \\ F_{n-1}+F_{n-2} & \mbox{otherwise }\end{cases}$$

The fact that we calculate the Fibonacci sequence as little importance, except that the value of $F_n$ can grow really fast in $O(e^n)$ if I remember correctly. And the recursive implementation will have a hard time getting beyond (n=100) as the number of call will be greater than $O(e^n)$ as well. Be careful especially if you calculate $F_{F_n}$ or more composition. Remembering that n=5 is stable via $F$ might be useful.

Here are the first terms of the Fibonacci sequence:

  1. 1
  2. 1
  3. 1+1 = 2
  4. 2+1 = 3
  5. 3+2 = 5
  6. 5+3 = 8
  7. 8+5 = 13 ...

Basic Python cross-language integration

In [1]:
import sys
sys.version_info
Out[1]:
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

Python offer many facilities to call into other languages, whether we "shell-out" or use the C-API. The easiest for the end user is often to use subprocess , with the recently added run command. Though it might be annoying to define all foreign code in strings and calling the subprocess run function manually.

This is why in the following you will see 2 constructs:

optional_python_lhs = %language (language RHS expression)

As well as

%%language --cli like arguments
A block:
  containing expressions and statement
from another:
  language

IPython define these which are called Magics. A single percent : %something for line magics, act only on the rest of the line, and cells magics (start with %%)

One example of line magic is %timeit, which runs the following statement multiple time to get statistics about runtime:

In [2]:
%timeit [x for x in range(1000)]
10000 loops, best of 3: 34.9 µs per loop

Streamlining calling subprocess

The IPython team special cased a couple of these for foreign languages, here ruby. We define the fibonacci function, compute the 15th value, print it to standard out from Ruby, and capture the value variable in the python fib15.

In [3]:
%%ruby --out fib15
def fibonacci( n )
  return  n  if ( 0..1 ).include? n
  ( fibonacci( n - 1 ) + fibonacci( n - 2 ) )
end
puts fibonacci( 15 )
In [4]:
fib15
Out[4]:
'610\n'

Now from with Python we can do a crude parsing of the previous string output, and get the value of Fibonacci of 15.

In [5]:
int(fib15.strip())
Out[5]:
610

Ok, that's somewhat useful, but not really that much. It's convenient for self contain code. You cannot pass variable in... or can't you ?

Send variables in

Calling subprocess can be quite cumbersome when working interactively, we saw above that %% cells-magics can be of help, but you might want to shell out in the middle of a python function. Let's create a bunch of random file-names and fake some subprocess operations with it.

In [6]:
import random
import string

def rand_names(k=10,l=10):
    for i in range(k):
        yield '_' + ''.join(random.choice(string.ascii_letters) for i in range(l))+'.o'

The !something expression is – for the purpose of these demo – equivalent to %sh something $variable where $variable is looked up in locals() and replaced by it's __repr__

In [7]:
for f in rand_names():
    print('creating file',f)
    !touch $f
creating file _XiKZxtsLwX.o
creating file _DKCjzvlTuF.o
creating file _ShTGoJMxCp.o
creating file _nbockcrTbT.o
creating file _cZnVpuYsxJ.o
creating file _UYxnHlwJwy.o
creating file _fxoVMPQbJV.o
creating file _wJJUbrPzpq.o
creating file _ngMMBxaDkG.o
creating file _uKbssAzHBP.o
In [8]:
ls -1 *.o
_DKCjzvlTuF.o
_ShTGoJMxCp.o
_UYxnHlwJwy.o
_XiKZxtsLwX.o
_cZnVpuYsxJ.o
_fxoVMPQbJV.o
_nbockcrTbT.o
_ngMMBxaDkG.o
_uKbssAzHBP.o
_wJJUbrPzpq.o

We can as well get values back using the ! syntax

In [9]:
files = !ls *.o
files
Out[9]:
['_DKCjzvlTuF.o',
 '_ShTGoJMxCp.o',
 '_UYxnHlwJwy.o',
 '_XiKZxtsLwX.o',
 '_cZnVpuYsxJ.o',
 '_fxoVMPQbJV.o',
 '_nbockcrTbT.o',
 '_ngMMBxaDkG.o',
 '_uKbssAzHBP.o',
 '_wJJUbrPzpq.o']
In [10]:
!rm -rf  *.{o,c,so} Cargo.* src target

(Who said I was going to use rust-lang.org later ?)

In [11]:
ls *.o
ls: *.o: No such file or directory

Ok, our directory is clean !

Add some state

Ok, that was kind of cute, fire-up a subprocess, serialize, pipe data in as a string, pipe-data out as a string, kill subprocess... What about something less state-less, or more stateful?

Let's define the fibonacci function in python:

In [12]:
def fib(n):
    """
    A simple definition of fibonacci manually unrolled
    """
    if n<2:
        return 1
    x,y = 1,1
    for i in range(n-2):
        x,y = y,x+y
    return y
In [13]:
[fib(i) for i in range(1,10)]
Out[13]:
[1, 1, 2, 3, 5, 8, 13, 21, 34]

Store the value from 1 to 30 in Y, and graph it.

In [14]:
%matplotlib inline
import numpy as np
X = np.arange(1,30)
Y = np.array([fib(x) for x in X])
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(X, Y)
ax.set_xlabel('n')
ax.set_ylabel('fib(n)')
ax.set_title('The Fibonacci sequence grows fast !')
Out[14]:
<matplotlib.text.Text at 0x10caff470>

It may not surprise you, but this looks like an exponential, so if we were to look at $log(fib(n))$ × $n$ it would look approximately like a line. We can try to do a linear regression using this model. R is a language many people use to do statistics. So, let's use R.

Let's enable integration between Python and R using the RPy2 python package developed by Laurent Gautier and the rest of the rpy2 team.

(Side note, you might need to change the environment variable passed to your kernel for this to work. Here is what I had to do only once.)

In [15]:
#!a2km add-env 'python 3' DYLD_FALLBACK_LIBRARY_PATH=$HOME/anaconda/pkgs/icu-54.1-0/lib:/Users/bussonniermatthias/anaconda/pkgs/zlib-1.2.8-3/lib 
In [16]:
import rpy2.rinterface

%load_ext rpy2.ipython

The Following will "Send" the X and Y array to R.

In [17]:
%Rpush Y X

And now let's try to fit a linear model ($ln(Y) = A.X + B$) using R. I'm not a R user myself, so don't take this as idiomatic R.

In [18]:
%%R
my_summary = summary(lm(log(Y)~X))
val <- my_summary$coefficients

plot(X, log(Y))
abline(my_summary)
In [19]:
%%R
my_summary
Call:
lm(formula = log(Y) ~ X)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.183663 -0.013497 -0.004137  0.006046  0.296094 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.775851   0.026173  -29.64   <2e-16 ***
X            0.479757   0.001524  314.84   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06866 on 27 degrees of freedom
Multiple R-squared:  0.9997,	Adjusted R-squared:  0.9997 
F-statistic: 9.912e+04 on 1 and 27 DF,  p-value: < 2.2e-16

Good, we have now the some statistics on the fit, which also looks good. And we were able to not only send variable to R, but to plot directly from R !

We are happy as $F_n = \left[\frac{\phi^n}{\sqrt 5}\right]$, where [] is closest integer and $\phi = \frac{1+\sqrt 5}{2}$

We can also look at the variables more carefully

In [20]:
%%R
val
              Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -0.7758510 0.026172673 -29.64355 3.910319e-22
X            0.4797571 0.001523832 314.83597 1.137181e-49

Or even the following that looks more like python

In [21]:
%R val
Out[21]:
array([[ -7.75850975e-01,   2.61726725e-02,  -2.96435519e+01,
          3.91031947e-22],
       [  4.79757090e-01,   1.52383191e-03,   3.14835966e+02,
          1.13718145e-49]])

We can even get the variable back from R as Python objects:

In [22]:
coefs = %Rget val
y0,k = coefs.T[0]
y0,k
Out[22]:
(-0.77585097534858738, 0.4797570904348315)

That's all from the R part. I hope this shows you some of the power of IPython, both in notebook and command line.

CFFI

Great! We were able to send data back and forth! If does not works for all objects, but at least for the basic ones. It requires quite some work from the authors of the underlying library to allow you to do that. Though we are still limited to data. We can't (yet) send functions over which limits the utility.

Mix and Match : C

One of the critical point of any code may at some point be performance. Python is known to not be the most performant language, though it is convenient and quick to write and has a large ecosystem. Most of the function you requires are probably available in a package, battle tested and optimized.

You might still need here and there the raw power of an ubiquitous language which is known for its speed when you know how to wield it well: C.

Though one of the disadvantage of C is the (relatively) slow iteration process due to the necessity of compilation/run part of the cycle. Let see if we can improve that by leveraging the excellent CFFI project, using my own small cffi_magic wrapper.

In [23]:
import cffi_magic
In [24]:
rm -rf *.o *.c *.so Cargo.* src target
In [25]:
ls *.c *.h *.o
ls: *.c: No such file or directory
ls: *.h: No such file or directory
ls: *.o: No such file or directory

Using the %%cffi magic we can define in the middle of our python code some C function:

In [26]:
%%cffi int cfib(int);

int cfib(int n)
{
    int res=0;
    if (n <= 1){
        res = 1;
    } else {
        res = cfib(n-1)+cfib(n-2);
    }
    return res;
}

The first line take the "header" of the function we declare, and the rest of the cell takes the body of this function. The cfib function will automatically be made available to you in the main python namespace.

In [27]:
cfib(5)
Out[27]:
8

Oops there is a mistake as we should have fib(5) == 5. Luckily we can redefine the function on the fly. I could edit the above cell, but here as this will be rendered statically for the sake of demo purpose, I'm going to make a second cell:

In [28]:
%%cffi int cfib(int);

int cfib(int n)
{
    int res=0;
    if (n <= 2){  /*mistake was here*/
        res = 1;
    } else {
        res = cfib(n-1)+cfib(n-2);
    }
    return res;
}
In [29]:
cfib(5)
Out[29]:
5

Great ! Let's compare the timing.

In [30]:
%timeit cfib(10)
The slowest run took 73.42 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 379 ns per loop
In [31]:
%timeit fib(10)
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 853 ns per loop

Not so bad considering the C implementation is recursive, and the Python version is manually hand-rolled.

Implementation detail

So how do we do that magically under the hood? The knowledgeable reader is aware that CPython extensions cannot be reloaded. Though here we redefine the function... how come?

Using the user provided code we compile a shared object with a random name, import this as a module and alias using a user friendly name in the __main__ namespace. If the user re-execute we just get a new name, and change the alias mapping.

If one wan to optimize you can use a hash of the codecell string to not recompile if the user hasn't changed the code.

In [32]:
ls *.o *.c
_cffi_cWtAstIlGT.c  _cffi_cWtAstIlGT.o  _cffi_yxGAKIqXRR.c  _cffi_yxGAKIqXRR.o

With this in mind you can guess the same can be done for any language which can be compiled to a shared object, or a dynamically loadable library.

Mix and Match : rust

The cffi module also allows you to do the same with Rust, a new language designed by Mozilla, which provide the same C-like level of control, while incorporating more recent understanding of programming and provide better memory safety. Let's see how we would do the same with Rust:

In [33]:
%%rust int rfib(int);

#[no_mangle]
pub extern fn rfib(n: i32) -> i32 {
    match n {
        0 => 1,
        1 => 1,
        2 => 1,
        _ => rfib(n-1)+rfib(n-2)
    }
}
injecting  rfib in user ns
In [34]:
[rfib(x) for x in range(1,10)]
Out[34]:
[1, 1, 2, 3, 5, 8, 13, 21, 34]

I'm not a Rustacean, but the above seem pretty straightforward to me. Again this might not be idiomatic Rust but you should be able to decipher what's above. The same than for C applies.

Still in development

Both the C and Rust example shown above use the cffi_magic on which I spent roughly 4 hours total, so the functionalities can be really crude and the documentation minimal at best. Feel free to send PRs if you are interested.

Fortran

The fortran magic does the same as above, but has been developed by mgaitan and is slightly older. Again no surprise except you are supposed to mark fortran variable that are used to return the values.

In [35]:
%load_ext fortranmagic
/Users/bussonniermatthias/anaconda/lib/python3.5/site-packages/fortranmagic.py:147: UserWarning: get_ipython_cache_dir has moved to the IPython.paths module since IPython 4.0.
  self._lib_dir = os.path.join(get_ipython_cache_dir(), 'fortran')
In [36]:
%%fortran
RECURSIVE SUBROUTINE ffib(n, fibo)  
    IMPLICIT NONE
    INTEGER, INTENT(IN) :: n
    INTEGER, INTENT(OUT) :: fibo
    INTEGER :: tmp
    IF (n <= 2) THEN 
        fibo = 1
    ELSE
        CALL ffib(n-1,fibo)
        CALL ffib(n-2,tmp)
        fibo = fibo + tmp
    END IF
END SUBROUTINE ffib
In [37]:
[ffib(x) for x in range(1,10)]
Out[37]:
[1, 1, 2, 3, 5, 8, 13, 21, 34]

No surprise here, you are well aware of what we are doing.

Cython

IPython used to ship with the Cython magic that is now part of Cython itself. Cython is a superset of Python that compiles to C and importable from Python. You should be a ble to take your python code as is, type annotate it, and get c-like speed. The same principle applies:

In [38]:
import cython
In [39]:
%load_ext cython
In [40]:
%%cython

def cyfib(int n): # note the `int` here
    """
    A simple definition of fibonacci manually unrolled
    """
    cdef int x,y # and the `cdef int x,y` here
    if n < 2:
        return 1
    x,y = 1,1
    for i in range(n-2):
        x,y = y,x+y
    return y
In [41]:
[cyfib(x) for x in range(1,10)]
Out[41]:
[1, 1, 2, 3, 5, 8, 13, 21, 34]

benchmark

In [42]:
%timeit -n100 -r3 fib(5)
100 loops, best of 3: 648 ns per loop
In [43]:
%timeit -n100 -r3 cfib(5)
The slowest run took 11.60 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 578 ns per loop
In [44]:
%timeit -n100 -r3 ffib(5)
100 loops, best of 3: 147 ns per loop
In [45]:
%timeit -n100 -r3 cyfib(5)
100 loops, best of 3: 45.6 ns per loop

The benchmark result can be astonishing, but keep in mind that the Python and Cython version use manually unrolled loop. Main point being that we reached our goal and used Fortran, Cython, C (and Rust) in the middle of our Python program.

[let's skip the Rust fib version, it tends to segfault, and it would be sad to segfault now :-) ]

In [46]:
# %timeit rfib(10)

The Cake is not a lie!

So can we do a layer cake? Can we call rust from Python from Fortran from Cython? Or Cython from C from Fortran? Or Fortron from Cytran from Cust?

In [47]:
import itertools
lookup = {'c':cfib,
       # 'rust': rfib, # as before Rust may segfault, but I dont' know why ...
       'python': fib,
       'fortran': ffib,
       'cython': cyfib
         }

print("Pray the demo-gods it wont segfault even without rust...")
Pray the demo-gods it wont segfault even without rust...
In [48]:
for function in lookup.values():
    assert function(5) == 5, "Make sure all is correct or will use 100% CPU for a looong time."
In [49]:
for order in itertools.permutations(lookup):
    t = 5
    for f in order:
        t = lookup[f](t)
    
    print(' -> '.join(order), ':', t)
fortran -> cython -> python -> c : 5
fortran -> cython -> c -> python : 5
fortran -> python -> cython -> c : 5
fortran -> python -> c -> cython : 5
fortran -> c -> cython -> python : 5
fortran -> c -> python -> cython : 5
cython -> fortran -> python -> c : 5
cython -> fortran -> c -> python : 5
cython -> python -> fortran -> c : 5
cython -> python -> c -> fortran : 5
cython -> c -> fortran -> python : 5
cython -> c -> python -> fortran : 5
python -> fortran -> cython -> c : 5
python -> fortran -> c -> cython : 5
python -> cython -> fortran -> c : 5
python -> cython -> c -> fortran : 5
python -> c -> fortran -> cython : 5
python -> c -> cython -> fortran : 5
c -> fortran -> cython -> python : 5
c -> fortran -> python -> cython : 5
c -> cython -> fortran -> python : 5
c -> cython -> python -> fortran : 5
c -> python -> fortran -> cython : 5
c -> python -> cython -> fortran : 5
In [50]:
print('It worked ! I can run all the permutations !')
It worked ! I can run all the permutations !

The Cherry on the Layer Cake, with Julia

If you have a small idea about how the above layer-cake is working you'll understand that there is (still) a non-negligible overhead as between each language switch we need to go back to Python-land. And the scope in which we can access function is still quite limited. The following is some really Dark Magic concocted by Fernando Perez and Steven Johnson using the Julia programming language. I can't even pretend to understand how this possible, but it's really impressive to see.

Let's try to handwave what's happening. I would be happy to get corrections.

The crux is that the Python and Julia interpreters can be started in a way where they each have access to the other process memory. Thus the Julia and Python interpreter can share live objects. You then "just" need to teach the Julia language about the structure of Python objects and it can manipulate these as desired, either directly (if the memory layout allow it) or using proxy objects that "delegate" the functionality to the python process.

The result being that Julia can import and use Python modules (using the Julia PyCall package), and Julia functions are available from within Python using the pyjulia module.

Let's see how this look like.

In [51]:
%matplotlib inline
In [52]:
%load_ext julia.magic
Initializing Julia interpreter. This may take some time...
In [53]:
julia_version = %julia VERSION
julia_version # you can see this is a wrapper
Out[53]:
<PyCall.jlwrap 0.5.0>

He we tell the julia process to import the python matplotlib module, as well as numpy.

In [54]:
%julia @pyimport matplotlib.pyplot as plt
In [55]:
%julia @pyimport numpy as np
In [56]:
%%julia
                                        # Note how we mix numpy and julia:
t = linspace(0, 2*pi,1000);             # use the julia `linspace` and `pi`
s = sin(3*t + 4*np.cos(2*t));           # use the numpy cosine and julia sine
fig = plt.gcf()                         # **** WATCH THIS VARIABLE ****
plt.plot(t, s, color="red", linewidth=2.0, linestyle="--", label="sin(3t+4.cos(2t))")
Out[56]:
[<matplotlib.lines.Line2D at 0x327caeb70>]

All the above block of code is Julia, where, linspace,pi,sin are builtins of Julia. np.* and plt.* are referencing Python function and methods.

We see that t is a Julia "Array" (technically a 1000-element LinSpace{Float64}), which can get sent to numpy.cos, multiply by a Julia int, (..etc) and end up being plotted via matplotlib (Python), and displayed inline.

Let's finish our graph in Python

In [57]:
import numpy as np
fig = %julia fig
fig.axes[0].plot(X[:6], np.log(Y[:6]), '--', label='fib')
fig.axes[0].set_title('A weird Julia function and Fib')
fig.axes[0].legend()

fig
Out[57]:

Above we get the reference to our previously defined figure (in Julia), plot the log of our fib function. The key value here is that we get the same object from within Python and Julia. But let's push even further.

Above we had explicit transition between the Julia code and the Python code. Can we be more sneaky?

One toy example is to define the Fibonacci function using the recursive form and explicitly pass the function with which we recurse.

We'll define such a function both on the Julia and Python side, ask the Julia function to recurse by calling the Python one, and the Python one to recurse using the Julia one.

Let's print (P when we enter Python Kingdom, (J when we enter Julia Realm, and close the parenthesis accordingly:

In [58]:
from __future__ import print_function


# julia fib function
jlfib = %julia _fib(n, pyfib) = n <= 2 ? 1 : pyfib(n-1, _fib) + pyfib(n-2, _fib)


def pyfib(n, _fib):
    """
    Python fib function
    """
    print('(P', end='')
    if n <= 2:
         r = 1
    else:
        print('(J', end='')
        # here we tell julia (_fib) to recurse using Python
        r =  _fib(n-1, pyfib) + _fib(n-2, pyfib)
        print(')',end='')
    print(')',end='')
    return r
In [59]:
fibonacci = lambda x: pyfib(x, jlfib)

fibonacci(10)
(P(J(P(J(P(J(P(J(P)(P)))(P(J))(P(J))(P)))(P(J(P(J))(P)(P)(P)))(P(J(P(J))(P)(P)(P)))(P(J(P)(P)))))(P(J(P(J(P(J))(P)(P)(P)))(P(J(P)(P)))(P(J(P)(P)))(P(J))))(P(J(P(J(P(J))(P)(P)(P)))(P(J(P)(P)))(P(J(P)(P)))(P(J))))(P(J(P(J(P)(P)))(P(J))(P(J))(P)))))
Out[59]:
55

Cross language is Easy

I hope you enjoyed that, I find it quite interesting and useful when you need to leverage the tools available across multiple domains. I'm sure there are plenty of other tools that allow this kind of things and a host of other languages that can interact with each other in this way.

From the top of my head I know of a few magics (SQL, Redis...) that provide such integration. Every language has its strong and weak points, and knowing what to use is often hard. I hope I convinced you that mixing languages is not such a daunting task.

The other case when this is useful is when you are learning a new language, you can leverage your current expertise temporarily and get something that work before learning the idiomatic way and available libraries.

Comments

If you have comments suggestions please open an issue on GitHub

Cat Tax

Here is a Fibonacci cat to thank you from reading until the end, sorry, no banana for scale.

In [60]:
from IPython.display import Image
print('Pfiew')
Image('http://static.boredpanda.com/blog/wp-content/uploads/2016/02/fibonacci-composition-cats-furbonacci-91__700.jpg')
Pfiew
Out[60]:

One less Pull Request Followup

My earlier blog post could not have been more timely, here is what I received this morning:

Hacktoberfest is back! Ready to hack away?

It’s that time of year again! Hacktoberfest 2016 is right around the corner and we’re back with new, featured projects and the chance to win the limited-edition Hacktoberfest T-shirt you all love.

Read the blog post to see what’s changed this year and share on your favorite social media networks with #Hacktoberfest.

Community Feedback.

I quickly got some feedback in particular from Aaron Meurer. First there is no comment box on this blog. I tried it's painful to maintain, need moderation (...etc) so you can ping me on twitter, or open a GitHub. I think it's a high enough filter, if you have something to say to what I write here, you (likely) already have a GitHub account.

Also, I ran a poll on twitter, with 41 responses, 51% (I assume 21 users) prefer few PRs. 49% (20 users) don't really count. So there is definitively a non-negligible population that will still be ok to contribute. This also explains Aaron tweet:

if the number of pull requests discouraged pull requests we wouldn't have so many pull requests

I would argue that if 50% of your users are not discouraged, you still discourage the other half. Which might be fine according to your own metrics.

How to close ?

Aaron expressed his concern that closing a PR without a comment might send the unintended message that:

  • The maintainer does not want your code (assuming the maintainer closes)
  • The Author does not want the project to get his code anymore (assuming the author closes)

It might not have been clear enough in earlier post but when closing please, explain why you are closing, and what you expect. Here is one example of the IPython repository where the maintainers have close the PR:

@takluyver and @ellisonbg have decided that we are going to close this PR and open an issue. We are still interested in this work going in, but the tests need to be written first. Feel free to re-open when the tests are ready.

If you are closing your own work, please make it clear whether:

  • You plan to work later on that
  • If what you did can be reuse by future developers.

When not to close

Aaron pointed again that as a maintainer he prefers to keep PRs open. Now that GitHub allows you to give maintainers ability to push on your branch, You can give maintainers the ability to push on the PR. In case you disagree whether the PR should be close or not, exchange with the maintainer. The commits of a close PR can still be accessed and maybe a best course of action is for the maintainer to for your branch and re-issue the PR. They will have more control over it. Regardless discuss with the maintainer, convey your intent.

Look deeper into maintainers habits.

I purposely avoided the subject, but Andreas Mueller pointed out that you can actually look at recently merged PRs, and commit history. Only if all PR are old is Andreas discouraged.

This is a valid strategy, but it requires time from the person that want to submit the PR. And it's not always easy to do. I tend to go this extra step when I really think the PR is worth it.

Things are subjective

Again, all these are personal thoughts and preferences. I prefer to have few PR, like I would prefer to have zero-inbox. I would be curious to see analysis of general type of contribution vs number of opened PRs. Are novice users less likely to contribute depending on the number of opened PRs? Are the structures of networks across project different ?

Happy HacktoberFest.

One less Pull Request

This time of the year again, it's soon going to be the period where many websites and organisation will push you to make contribution to Open-Source, for example via hacktoberfest I got a nice T-shirt last year, and 24pullrequests seem to get tractions as well each years. Theses are really nice incentive that push users of open-source to start contributing and already seasons developers to try touch new project.

Here is a request I got for you whether you participate or not to these events: Please close a Pull Request.

Less is More

While I really appreciate having new contributions, there is a point were too many opened pull-requests can – I think – be harmful. I'm going to expose the various case, why I think these are harmful and what can be done.

Here are two specific examples : the Sympy Project (as Aaron feel targeted), the authors are absolutely extraordinary and reactive. The current count of opened PR is 378. Matplotlib is also apparently at 207. You can see in the discussion linked here that maintainers feel differently about high number of PRs.

I open to many pull requests

I currently have 12 opened pull requests, see how many you have. This mean that I (at least) have to follow-up with around 12 projects every days. This is an extremely hight cognitive cost of switching. I try to not keep a PR older than 6 month. If it's older then it's most likely not going to be merged or taken care of by the maintainers. Every time I get to this screen I at least spend 30 sec wondering what to do about old PRs.

My advice is to stay focus: If you are not going to work on a Pull Request, let the maintainers know about this fact: close it. It can still be reopened. You might want to leave a message explaining why you are not working on it, and that you would be happy (or not), for someone else to take over.

I'm now back to 8. It fits on one screen, I can be more focused.

Also if you are a maintainer and know a pull-request will likely not get merged, I would prefer you don't give me false hope, and close it. Explain why. Even if it's just that's you are busy on something else and would appreciate if this was resubmitted later. I'm more likely to get over it and try a few other time than if my first contribution got no responses.

I receive too many pull-requests

I strongly encourage you to try minrk.github.io/all-my-pulls it allows you to view all the pull-requests you have the ability to merge. And filter by repositories you do not wish to see. After filtering, I have 61 pull requests in 19 repos. It is too much to stay focused as well.

Many of these pull-requests have stalled, and I would gladly appreciate for the authors to close them if they have no intention on working on things. To be honest many of the oldest pull-requests have entered this "Awkward state" of wanting to close it but not actually doing so because it can be rough for the author to see his work dismiss.

As a maintainer I should do a better job as saying when a Pull request have stalled and is just polluting the PR list. Close it with a nice explanation. It's always possible to reopen if needed. GitHub allows canned responses, I use it as a template to list the policy of PR closing. I've found that having a clear policy often make decision easier. And sometime closing even allow work to be resubmitted, to appear on the top of the pile, and start anew.

There is also the possibility of taking over the author work and finishing up in a separate PR, or push directly on authors forks if he is allowing it. I personally rarely do that, as I feel like it is a slippery slope for the maintainer to do everything.

I find myself much more efficient when there is only 5 to 6 opened pull-requests. I can keep track of each of them, judge whether or not the work will conflict and give proper care to each of these. I fail to do so when there are many pages.

I don't contribute to repository that have too many PRs.

When I come across a repository with more than 20-ish pull-requests, I tend to think that the authors are not responding so why bother to contribute. I know that often these are only impressions and I can get over it because I have the chance to often know the maintainers. This feeling is though hard to get over on repositories I'm new to.

With a high number of opened PRs, I tend to also be discouraged at searching whether someone is fixing the bug I saw, or implementing the feature I wish. Moreover the higher the number of opened PRs the more chance there is for the maintainers to review my PR in a long time, and the higher chance there will be that I will need to rebase my work, which regardless of whether you are a git master or not can be painful process to go through (and to ask someone to go through).

I'm pretty certain I'm not the only one to be discouraged from seeing a large number of open non active Pull requests. I've asked on twitter and it looks like roughly every other respondent are discouraged to contribute if too many PR are opened.

What do you think ?

The above paragraphs are my though on too many opened pull-requests ? How are you feeling about that ? As you might have read in the twitter conversation linked to above, different people have different opinions.

If you want to comment, please open an issue on GitHub, and if you have the courage to help improve my English feel free to send me a PR (sic) to make this more readable.

Close a PR !

Thanks you for reading up until here ! If you want to restore part of the sanity of some maintainers, or want to appeal a bit more to some users, please go close a PRs ! Or help finish a Pr that have stalled ! I can't give you a free T-shirt like for HactoberFest but feel free to tweet with hashtag #IClosedAPR !